In [2]:
%matplotlib inline
import pandas as pd
import matplotlib.pyplot as plt 
import cufflinks as cf
import plotly.offline as py
import plotly.graph_objs as go
df_mcr=pd.read_csv("mcr.csv",sep=",")
cf.go_offline()
py.init_notebook_mode() 
df_mcr.head(15)
C:\Users\ink\Anaconda3\lib\site-packages\IPython\core\interactiveshell.py:2785: DtypeWarning:

Columns (0,2,8,10,21,23,24,25,26,27,28,44,56,64,83,85,87,107,109,123,125,150,157,172,174,194,210,218,219,223,246,249,262,264,276,277,278,279,280,281,282,283,284,285,286,287,288,289,290,304,306,325,326,329,341,368,371,384,385,389,390,391,393,394) have mixed types. Specify dtype option on import or set low_memory=False.

Out[2]:
Time from Start to Finish (seconds) Q1 Q1_OTHER_TEXT Q2 Q3 Q4 Q5 Q6 Q6_OTHER_TEXT Q7 ... Q49_OTHER_TEXT Q50_Part_1 Q50_Part_2 Q50_Part_3 Q50_Part_4 Q50_Part_5 Q50_Part_6 Q50_Part_7 Q50_Part_8 Q50_OTHER_TEXT
0 Duration (in seconds) What is your gender? - Selected Choice What is your gender? - Prefer to self-describe... What is your age (# years)? In which country do you currently reside? What is the highest level of formal education ... Which best describes your undergraduate major?... Select the title most similar to your current ... Select the title most similar to your current ... In what industry is your current employer/cont... ... What tools and methods do you use to make your... What barriers prevent you from making your wor... What barriers prevent you from making your wor... What barriers prevent you from making your wor... What barriers prevent you from making your wor... What barriers prevent you from making your wor... What barriers prevent you from making your wor... What barriers prevent you from making your wor... What barriers prevent you from making your wor... What barriers prevent you from making your wor...
1 710 Female -1 45-49 United States of America Doctoral degree Other Consultant -1 Other ... -1 NaN NaN NaN NaN NaN NaN NaN NaN -1
2 434 Male -1 30-34 Indonesia Bachelor’s degree Engineering (non-computer focused) Other 0 Manufacturing/Fabrication ... -1 NaN NaN NaN NaN NaN NaN NaN NaN -1
3 718 Female -1 30-34 United States of America Master’s degree Computer science (software engineering, etc.) Data Scientist -1 I am a student ... -1 NaN Too time-consuming NaN NaN NaN NaN NaN NaN -1
4 621 Male -1 35-39 United States of America Master’s degree Social sciences (anthropology, psychology, soc... Not employed -1 NaN ... -1 NaN NaN Requires too much technical knowledge NaN Not enough incentives to share my work NaN NaN NaN -1
5 731 Male -1 22-24 India Master’s degree Mathematics or statistics Data Analyst -1 I am a student ... -1 NaN Too time-consuming NaN NaN Not enough incentives to share my work NaN NaN NaN -1
6 1142 Male -1 25-29 Colombia Bachelor’s degree Physics or astronomy Data Scientist -1 Computers/Technology ... -1 NaN NaN NaN Afraid that others will use my work without gi... NaN I had never considered making my work easier f... NaN NaN -1
7 959 Male -1 35-39 Chile Doctoral degree Information technology, networking, or system ... Other 1 Academics/Education ... -1 Too expensive NaN NaN NaN NaN I had never considered making my work easier f... NaN NaN -1
8 1758 Male -1 18-21 India Master’s degree Information technology, networking, or system ... Other 2 Other ... -1 NaN NaN NaN NaN Not enough incentives to share my work NaN NaN NaN -1
9 641 Male -1 25-29 Turkey Master’s degree Engineering (non-computer focused) Not employed -1 NaN ... -1 NaN NaN NaN NaN NaN NaN NaN NaN -1
10 751 Male -1 30-34 Hungary Master’s degree Engineering (non-computer focused) Software Engineer -1 Online Service/Internet-based Services ... -1 NaN Too time-consuming NaN Afraid that others will use my work without gi... NaN NaN NaN NaN -1
11 2028 Male -1 22-24 Ireland Bachelor’s degree Information technology, networking, or system ... Student -1 I am a student ... -1 NaN Too time-consuming NaN NaN NaN NaN NaN NaN -1
12 823 Male -1 40-44 United States of America Master’s degree Engineering (non-computer focused) Data Scientist -1 Other ... -1 NaN Too time-consuming NaN NaN Not enough incentives to share my work NaN NaN NaN -1
13 1091 Male -1 25-29 France Doctoral degree Mathematics or statistics Student -1 I am a student ... -1 NaN NaN NaN NaN NaN NaN None of these reasons apply to me NaN -1
14 1917 Male -1 25-29 United States of America Bachelor’s degree Mathematics or statistics Research Assistant -1 Academics/Education ... -1 NaN Too time-consuming NaN NaN NaN NaN NaN NaN -1

15 rows × 395 columns

Gender Distribution of the Kaggle's Survey Participants

In [3]:
import numpy as np
mask1=df_mcr["Q1"]=="Male"
mask2=df_mcr["Q1"]=="Female"
df_mcr[np.logical_or(mask1,mask2)]
plt.figure(figsize=(15,15))
c=df_mcr[np.logical_or(mask1,mask2)]["Q1"].value_counts()
data=go.Data([go.Bar(x=c.index,y=c.values,orientation="v")])
layout=go.Layout(height=800,title="Gender Distribution of the Survey")
fig=go.Figure(data,layout)
py.iplot(fig)
C:\Users\ink\Anaconda3\lib\site-packages\plotly\graph_objs\_deprecations.py:39: DeprecationWarning:

plotly.graph_objs.Data is deprecated.
Please replace it with a list or tuple of instances of the following types
  - plotly.graph_objs.Scatter
  - plotly.graph_objs.Bar
  - plotly.graph_objs.Area
  - plotly.graph_objs.Histogram
  - etc.


<Figure size 1080x1080 with 0 Axes>

AGE DISTRIBUTION OF THE KAGGLE'S SURVEY PARTICIPANTS

In [6]:
df_age=df_mcr["Q2"][1:].dropna()
a=df_age.value_counts()
data=go.Data([go.Bar(x=a.index,y=a.values,orientation="v",marker=dict(color='rgb(158,202,225)',
                line=dict(
                    color='rgb(8,48,107)',
                    width=1.5)))]) 
layout=go.Layout(height=800,title="Age of the Survey's Participants")
fig=go.Figure(data,layout)
py.iplot(fig)
C:\Users\ink\Anaconda3\lib\site-packages\plotly\graph_objs\_deprecations.py:39: DeprecationWarning:

plotly.graph_objs.Data is deprecated.
Please replace it with a list or tuple of instances of the following types
  - plotly.graph_objs.Scatter
  - plotly.graph_objs.Bar
  - plotly.graph_objs.Area
  - plotly.graph_objs.Histogram
  - etc.


COUNTRIES PARTICIPATING IN THE SURVEY

In [9]:
df_country=df_mcr[df_mcr!="I do not wish to disclose my location"]["Q3"][1:] 
c=df_country.value_counts()
data=go.Data([go.Bar(x=c.index,y=c.values,orientation="v",marker=dict(
color='#20716A',
opacity=0.5

))]) 
layout=go.Layout(height=900,title="Countries of the survey's participants")
figure=go.Figure(data,layout)
py.iplot(figure)
C:\Users\ink\Anaconda3\lib\site-packages\plotly\graph_objs\_deprecations.py:39: DeprecationWarning:

plotly.graph_objs.Data is deprecated.
Please replace it with a list or tuple of instances of the following types
  - plotly.graph_objs.Scatter
  - plotly.graph_objs.Bar
  - plotly.graph_objs.Area
  - plotly.graph_objs.Histogram
  - etc.


In [5]:
plt.figure(figsize=(12,12))
df_degree=df_mcr["Q4"][1:].dropna()
df_degree.value_counts().plot(kind="bar")
Out[5]:
<matplotlib.axes._subplots.AxesSubplot at 0x290909a9358>
In [10]:
field_of_study=df_mcr["Q5"][1:].value_counts() 
#data=go.Data([go.Pie(labels=field_of_study.index,values=field_of_study.values)]) 
#layout=go.layout(title="Field of Study") 
#figure=go.Figure(data,layout)  
fig={
    "data":[{
        "labels":field_of_study.index,
         "values":field_of_study.values,
        "textposition":"inside",
        "hole":0.5,
        "type":"pie"
    }],
    "layout":{
             'title':"Undergraduate Major",
        "annotations":[{
            "showarrow":False,
            "text":"Field of Study",
            "font":{
                "size":25
            }
        }
        ]
    }
}
py.iplot(fig)

What we can conclude from the graph above is the diversity of the domains which use Data Science as an analytical tool. In Business for example, algorithms are being used to predict stock prices and detect frauds. In medecine, the University of California San Diego's data science team implemented an optimized deep learning neural network model to detect anomalies in the human eyes just from the picture. In astronomy and physics, the CERN is the best example that can be given to illustrate the importance of data in such fields. The LHC collects petabytes of data from thousands of particle collisions that happen every second in the core of the LHC.

In [12]:
fs=df_mcr["Q6"][1:].value_counts() 
data=go.Data([go.Bar(x=fs.index,y=fs.values,orientation="v")])
layout=go.Layout(title="Field of Study",height=500)
fig=go.Figure(data,layout) 
py.iplot(fig)
C:\Users\ink\Anaconda3\lib\site-packages\plotly\graph_objs\_deprecations.py:39: DeprecationWarning:

plotly.graph_objs.Data is deprecated.
Please replace it with a list or tuple of instances of the following types
  - plotly.graph_objs.Scatter
  - plotly.graph_objs.Bar
  - plotly.graph_objs.Area
  - plotly.graph_objs.Histogram
  - etc.